This quickstart guide explains how to join two tables A and B using edit distance measure. First, you need to import the required packages as follows (if you have installed py_stringsimjoin it will automatically install the dependencies py_stringmatching and pandas):



In [20]:

    
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os, sys



In [21]:

    
print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)









    



python version: 2.7.12 |Anaconda 2.5.0 (x86_64)| (default, Jul  2 2016, 17:43:17) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]
py_stringsimjoin version: 0.1.0
py_stringmatching version: 0.2.0
pandas version: 0.17.1

Joining two tables using edit distance measure typically consists of three steps:

Loading the input tables
Profiling the tables
Performing the join

1. Loading the input tables

We begin by loading the two tables. For the purpose of this guide, we use the sample dataset that comes with the package.



In [22]:

    
# construct the path of the tables to be loaded. Since we are loading a 
# dataset from the package, we need to access the data from the path 
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.

table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_A.csv'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_B.csv'])



In [23]:

    
# Load csv files as dataframes.
A = pd.read_csv(table_A_path)
B = pd.read_csv(table_B_path)
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))









    



Number of records in A: 7
Number of records in B: 8



In [24]:

    
A









    Out[24]:






  
    
      
      A.id
      A.name
      A.birth_year
      A.hourly_wage
      A.address
      A.zipcode
    
  
  
    
      0
      a1
      Kevin Smith
      1989
      30.0
      607 From St, San Francisco
      94107
    
    
      1
      a2
      Michael Franklin
      1988
      27.5
      1652 Stockton St, San Francisco
      94122
    
    
      2
      a3
      William Bridge
      1986
      32.0
      3131 Webster St, San Francisco
      94107
    
    
      3
      a4
      Binto George
      1987
      32.5
      423 Powell St, San Francisco
      94122
    
    
      4
      a5
      Alphonse Kemper
      1984
      35.0
      1702 Post Street, San Francisco
      94122
    
    
      5
      a6
      
      1990
      40.0
      24th Street, San Francisco
      94122
    
    
      6
      a7
      NaN
      1986
      25.0
      20th Street, San Francisco
      NaN



In [25]:

    
B









    Out[25]:






  
    
      
      B.id
      B.name
      B.birth_year
      B.hourly_wage
      B.address
      B.zipcode
    
  
  
    
      0
      b1
      Mark Levene
      1987
      29.5
      108 Clement St, San Francisco
      94107
    
    
      1
      b2
      Bill Bridge
      1986
      32.0
      3131 Webster St, San Francisco
      94107
    
    
      2
      b3
      Mike Franklin
      1988
      27.5
      1652 Stockton St, San Francisco
      94122
    
    
      3
      b4
      Joseph Kuan
      1982
      26.0
      108 South Park, San Francisco
      94122
    
    
      4
      b5
      Alfons Kemper
      1984
      35.0
      170 Post St, Apt 4,  San Francisco
      94122
    
    
      5
      b6
      Michael Brodie
      1987
      32.5
      133 Clement Street, San Francisco
      94107
    
    
      6
      b7
      
      1990
      40.0
      24th Street, San Francisco
      94122
    
    
      7
      b8
      NaN
      1986
      25.0
      20th Street, San Francisco
      NaN

2. Profiling the tables

Before performing the join, we may want to profile the tables to know about the characteristics of the attributes. This can help identify:

a) unique attributes in the table which can be used as key attribute when performing the join. A key attribute is needed to uniquely identify a tuple.

b) the number of missing values present in each attribute. This can help you in deciding the attribute on which to perform the join. For example, an attribute with a lot of missing values may not be a good join attribute. Further, based on the missing value information you need to decide on how to handle missing values when performing the join (See the section below on 'Handling missing values' to know more about the options available for handling missing values when performing the join).

You can profile the attributes in a table using the following command:



In [26]:

    
# profile attributes in table A
ssj.profile_table_for_join(A)









    Out[26]:






  
    
      
      Unique values
      Missing values
      Comments
    
    
      Attribute
      
      
      
    
  
  
    
      A.id
      7 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      A.name
      7 (100.0%)
      1 (14.29%)
      Joining on this attribute will ignore 1 (14.29%) rows.
    
    
      A.birth_year
      6 (85.71%)
      0 (0.0%)
      
    
    
      A.hourly_wage
      7 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      A.address
      7 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      A.zipcode
      3 (42.86%)
      1 (14.29%)
      Joining on this attribute will ignore 1 (14.29%) rows.



In [27]:

    
# profile attributes in table B
ssj.profile_table_for_join(B)









    Out[27]:






  
    
      
      Unique values
      Missing values
      Comments
    
    
      Attribute
      
      
      
    
  
  
    
      B.id
      8 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      B.name
      8 (100.0%)
      1 (12.5%)
      Joining on this attribute will ignore 1 (12.5%) rows.
    
    
      B.birth_year
      6 (75.0%)
      0 (0.0%)
      
    
    
      B.hourly_wage
      8 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      B.address
      8 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      B.zipcode
      3 (37.5%)
      1 (12.5%)
      Joining on this attribute will ignore 1 (12.5%) rows.

If the input tables does not contain any key attribute, then you need to create a key attribute. In the current example, both the input tables A and B have key attributes, and hence you can proceed to the next step. In the case the table does not have any key attribute, you can add a key attribute using the following command:



In [28]:

    
B['new_key_attr'] = range(0, len(B))
B









    Out[28]:






  
    
      
      B.id
      B.name
      B.birth_year
      B.hourly_wage
      B.address
      B.zipcode
      new_key_attr
    
  
  
    
      0
      b1
      Mark Levene
      1987
      29.5
      108 Clement St, San Francisco
      94107
      0
    
    
      1
      b2
      Bill Bridge
      1986
      32.0
      3131 Webster St, San Francisco
      94107
      1
    
    
      2
      b3
      Mike Franklin
      1988
      27.5
      1652 Stockton St, San Francisco
      94122
      2
    
    
      3
      b4
      Joseph Kuan
      1982
      26.0
      108 South Park, San Francisco
      94122
      3
    
    
      4
      b5
      Alfons Kemper
      1984
      35.0
      170 Post St, Apt 4,  San Francisco
      94122
      4
    
    
      5
      b6
      Michael Brodie
      1987
      32.5
      133 Clement Street, San Francisco
      94107
      5
    
    
      6
      b7
      
      1990
      40.0
      24th Street, San Francisco
      94122
      6
    
    
      7
      b8
      NaN
      1986
      25.0
      20th Street, San Francisco
      NaN
      7

For the purpose of this guide, we will now join tables A and B on 'name' attribute using edit distance measure. Next, we need to decide on what threshold to use for the join. For this guide, we will use a threshold of 5. Specifically, the join will now find tuple pairs from A and B such that the edit distance over the 'name' attributes is at most 5.

3. Performing the join

The next step is to perform the edit distance join using the following command:



In [29]:

    
# find all pairs from A and B such that the edit distance
# on 'name' is at most 5.
# l_out_attrs and r_out_attrs denote the attributes from the 
# left table (A) and right table (B) that need to be included in the output.

output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, 
                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'])









    



0%   100%
[#######] | ETA: 00:00:00
Total time elapsed: 00:00:00



In [30]:

    
len(output_pairs)









    Out[30]:





4



In [31]:

    
# examine the output pairs
output_pairs









    Out[31]:






  
    
      
      _id
      l_A.id
      r_B.id
      l_A.name
      r_B.name
      _sim_score
    
  
  
    
      0
      0
      a3
      b2
      William Bridge
      Bill Bridge
      4
    
    
      1
      1
      a2
      b3
      Michael Franklin
      Mike Franklin
      4
    
    
      2
      2
      a5
      b5
      Alphonse Kemper
      Alfons Kemper
      3
    
    
      3
      3
      a6
      b7
      
      
      0

Handling missing values

By default, pairs with missing values are not included in the output. This is because a string with a missing value can potentially match with all strings in the other table and hence the number of output pairs can become huge. If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True, as shown below:



In [32]:

    
output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, allow_missing=True, 
                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'])









    



0%   100%
[#######] | ETA: 00:00:00
Total time elapsed: 00:00:00
0%  100%
[##] | ETA: 00:00:00





    



Finding pairs with missing value...






    



Total time elapsed: 00:00:00



In [33]:

    
output_pairs









    Out[33]:






  
    
      
      _id
      l_A.id
      r_B.id
      l_A.name
      r_B.name
      _sim_score
    
  
  
    
      0
      0
      a3
      b2
      William Bridge
      Bill Bridge
      4
    
    
      1
      1
      a2
      b3
      Michael Franklin
      Mike Franklin
      4
    
    
      2
      2
      a5
      b5
      Alphonse Kemper
      Alfons Kemper
      3
    
    
      3
      3
      a6
      b7
      
      
      0
    
    
      0
      4
      a7
      b1
      NaN
      Mark Levene
      NaN
    
    
      1
      5
      a7
      b2
      NaN
      Bill Bridge
      NaN
    
    
      2
      6
      a7
      b3
      NaN
      Mike Franklin
      NaN
    
    
      3
      7
      a7
      b4
      NaN
      Joseph Kuan
      NaN
    
    
      4
      8
      a7
      b5
      NaN
      Alfons Kemper
      NaN
    
    
      5
      9
      a7
      b6
      NaN
      Michael Brodie
      NaN
    
    
      6
      10
      a7
      b7
      NaN
      
      NaN
    
    
      7
      11
      a7
      b8
      NaN
      NaN
      NaN
    
    
      8
      12
      a1
      b8
      Kevin Smith
      NaN
      NaN
    
    
      9
      13
      a2
      b8
      Michael Franklin
      NaN
      NaN
    
    
      10
      14
      a3
      b8
      William Bridge
      NaN
      NaN
    
    
      11
      15
      a4
      b8
      Binto George
      NaN
      NaN
    
    
      12
      16
      a5
      b8
      Alphonse Kemper
      NaN
      NaN
    
    
      13
      17
      a6
      b8
      
      NaN
      NaN

Enabling parallel processing

If you have multiple cores which you want to exploit for performing the join, you need to use the n_jobs option. If n_jobs is -1, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging and is the default option. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).

The following command exploits all the cores available to perform the join:



In [34]:

    
output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, 
                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'], n_jobs=-1)









    



0%  100%
[##] | ETA: 00:00:00
Total time elapsed: 00:00:00



In [35]:

    
len(output_pairs)









    Out[35]:





4

You need to set n_jobs to 1 when you are debugging or you do not want to use any parallel computing code. If you want to execute the join as fast as possible, you need to set n_jobs to -1 which will exploit all the CPUs in your machine. In case there are other concurrent processes running in your machine and you do not want to halt them, then you may need to set n_jobs to a value below -1.

Performing join on numeric attributes

The join method expects the join attributes to be of string type. If you need to perform the join over numeric attributes, then you need to first convert the attributes to string type and then perform the join. For example, if you need to join 'A.zipcode' in table A with 'B.zipcode' in table B, you need to first convert the attributes to string type using the following command:



In [36]:

    
ssj.dataframe_column_to_str(A, 'A.zipcode', inplace=True)
ssj.dataframe_column_to_str(B, 'B.zipcode', inplace=True)









    Out[36]:





True

Note that the above command preserves the NaN values while converting the numeric column to string type. Next, you can perform the join as shown below:



In [37]:

    
output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.zipcode', 'B.zipcode', 1, 
                                      l_out_attrs=['A.zipcode'], r_out_attrs=['B.zipcode'])
output_pairs









    



0%   100%
[#######] | ETA: 00:00:00
Total time elapsed: 00:00:00






    Out[37]:






  
    
      
      _id
      l_A.id
      r_B.id
      l_A.zipcode
      r_B.zipcode
      _sim_score
    
  
  
    
      0
      0
      a1
      b1
      94107
      94107
      0
    
    
      1
      1
      a3
      b1
      94107
      94107
      0
    
    
      2
      2
      a1
      b2
      94107
      94107
      0
    
    
      3
      3
      a3
      b2
      94107
      94107
      0
    
    
      4
      4
      a2
      b3
      94122
      94122
      0
    
    
      5
      5
      a4
      b3
      94122
      94122
      0
    
    
      6
      6
      a5
      b3
      94122
      94122
      0
    
    
      7
      7
      a6
      b3
      94122
      94122
      0
    
    
      8
      8
      a2
      b4
      94122
      94122
      0
    
    
      9
      9
      a4
      b4
      94122
      94122
      0
    
    
      10
      10
      a5
      b4
      94122
      94122
      0
    
    
      11
      11
      a6
      b4
      94122
      94122
      0
    
    
      12
      12
      a2
      b5
      94122
      94122
      0
    
    
      13
      13
      a4
      b5
      94122
      94122
      0
    
    
      14
      14
      a5
      b5
      94122
      94122
      0
    
    
      15
      15
      a6
      b5
      94122
      94122
      0
    
    
      16
      16
      a1
      b6
      94107
      94107
      0
    
    
      17
      17
      a3
      b6
      94107
      94107
      0
    
    
      18
      18
      a2
      b7
      94122
      94122
      0
    
    
      19
      19
      a4
      b7
      94122
      94122
      0
    
    
      20
      20
      a5
      b7
      94122
      94122
      0
    
    
      21
      21
      a6
      b7
      94122
      94122
      0

Additional options

You can find all the options available for the edit distance join function using the help command as shown below:



In [ ]:

    
help(ssj.edit_distance_join)

More information

Similar to edit distance measure, you can use the package to perform join using other measures such as cosine, Dice, Jaccard, overlap and overlap coefficient. For measures such as TF-IDF which are not directly supported, you can perform the join using the filters provided in the package. To know more about other join methods as well as how to use filters, refer to the how-to guide (available from the package homepage).

	A.id	A.name	A.birth_year	A.hourly_wage	A.address	A.zipcode
0	a1	Kevin Smith	1989	30.0	607 From St, San Francisco	94107
1	a2	Michael Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
2	a3	William Bridge	1986	32.0	3131 Webster St, San Francisco	94107
3	a4	Binto George	1987	32.5	423 Powell St, San Francisco	94122
4	a5	Alphonse Kemper	1984	35.0	1702 Post Street, San Francisco	94122
5	a6		1990	40.0	24th Street, San Francisco	94122
6	a7	NaN	1986	25.0	20th Street, San Francisco	NaN

	B.id	B.name	B.birth_year	B.hourly_wage	B.address	B.zipcode
0	b1	Mark Levene	1987	29.5	108 Clement St, San Francisco	94107
1	b2	Bill Bridge	1986	32.0	3131 Webster St, San Francisco	94107
2	b3	Mike Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
3	b4	Joseph Kuan	1982	26.0	108 South Park, San Francisco	94122
4	b5	Alfons Kemper	1984	35.0	170 Post St, Apt 4, San Francisco	94122
5	b6	Michael Brodie	1987	32.5	133 Clement Street, San Francisco	94107
6	b7		1990	40.0	24th Street, San Francisco	94122
7	b8	NaN	1986	25.0	20th Street, San Francisco	NaN

	Unique values	Missing values	Comments
Attribute
A.id	7 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
A.name	7 (100.0%)	1 (14.29%)	Joining on this attribute will ignore 1 (14.29%) rows.
A.birth_year	6 (85.71%)	0 (0.0%)
A.hourly_wage	7 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
A.address	7 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
A.zipcode	3 (42.86%)	1 (14.29%)	Joining on this attribute will ignore 1 (14.29%) rows.

	Unique values	Missing values	Comments
Attribute
B.id	8 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
B.name	8 (100.0%)	1 (12.5%)	Joining on this attribute will ignore 1 (12.5%) rows.
B.birth_year	6 (75.0%)	0 (0.0%)
B.hourly_wage	8 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
B.address	8 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
B.zipcode	3 (37.5%)	1 (12.5%)	Joining on this attribute will ignore 1 (12.5%) rows.